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Quantitative linguistics has provided us with a number of empirical laws that characterise 
the evolution of languages JliUS] and competition amongst them i In terms of language 



usage, one of the most influential results is Zipf's law of word frequencies [5]. Zipf's law 
appears to be universal, and may not even be unique to human language |6]. However, there 
is ongoing controversy over whether Zipf's law is a good indicator of complexity |7, Here 
we present an alternative approach that puts Zipf's law in the context of critical phenomena 
(the cornerstone of complexity in physics [9]) and establishes the presence of a large-scale 
"attraction" between successive repetitions of words. Moreover, this phenomenon is scale- 
invariant and universal - the pattern is independent of word frequency and is observed in 
texts by different authors and written in different languages. There is evidence, however, that 
the shape of the scaling relation changes for words that play a key role in the text, implying 
the existence of different "universality classes" in the repetition of words. These behaviours 
exhibit striking parallels with complex catastrophic phenomena HAH]. 

Zipf's law states that the relative frequency Lof any word w in a text is approximately related 
with rank r w according to an inverse power law [|50: 

./WAS- (1) 

The rank r w of a word (its position in the list of words ordered by decreasing frequency) measures 
its rarity. In practice the exponent a is usually close to one. Note that Zipf's law describes a static 
property of language; shuffling pages, for instance, would not affect its validity. 

To investigate the dynamical properties of discourse generation, we can study the distance 
between consecutive appearances, or "tokens", of a given word w. This inter- appearance distance, 
denoted by £, is measured as the total number of words between two tokens plus one (so it takes 
on positive values 1, 2, etc.). It is immediately clear that £ can have a very wide range of values. 
In the novel Clarissa by S. Richardson, for example, the word depend appears 99 times with a 
mean inter-appearance distance of 9248. The minimum distance is 1, and the maximum is 50476. 
Although Zipf's law allows one to estimate the mean inter- appearance distance of each word (by 
means of £ w — 1/ ' f w r"), it cannot account for or predict the variability of word recurrence. 
In addition, it has been reported that high-frequency words are distributed according to a Poisson 
process ||13|1 . This is the simplest stochastic point process, characterised by a total lack of memory. 
Nevertheless, even the Poisson process is unable to explain such large variability. 

In order to characterize the dispersion in £, we would normally estimate its probability density. 



Due to the size of the dispersion, however, a large number of appearances are necessary to get 
significant statistics. But Zipf 's law tells us that there are many more low-frequency (high-rank) 
words than high-frequency (low-rank) words. Thus, the vast majority of words cannot be used in 
such a study, no matter how long the text considered. 

As an alternative, we define a rescaled, "dimensionless", inter- appearance distance: = £/£ w . 
For each word w, measures the inter- appearance distance in units of its own mean value, l w . We 
can then calculate the probability density D s (8) for large sets of words s having similar relative 
frequencies f w . We remark that we are not studying the heterogeneous distribution of words in a 

nnn 

collection of texts [|14I. I15LI16|1 . but rather the properties of word repetitions in an individual text. 
In fact, Zipf himself studied this issue using a similar approach for very low-frequency words Q5[] 
and proposed a power law for D s (6). As we shall see, the behaviour of D s (6) is actually much 
more complex. 

We have analysed eight texts in four languages (for details, see the Methods section). The 
curves in Fig. 1 show the probability densities of rescaled inter-appearance distances for all verbs 
of Clarissa appearing in their root form. The verbs are collected into six frequency groups s, from 
about 30 appearances to more than 10000, and D s (0) is plotted for each group in a different point 
style. It is clear that the six distributions collapse onto a unique curve. This phenomenon signals 
the approximate fulfilment of a law D s (6) = F(Q), where the scaling function F is independent 
of the word set s. It is therefore also independent of frequency, except for the smallest distances 
(1% 10). The shape of F approximates a gamma distribution over about five orders of magnitude, 

with the parameter a ~ 1/y (after rescaling, so that = 1) and T the Euler gamma function. A 
least-square fit yields y = 0.60 ± 0.05. 

The validity of this result extends well beyond verbs in Clarissa. The topmost curves in Fig. 
2 show inter-appearance distance distributions for adjectives in the same novel, well described by 
the scaling function F. Remarkably, other works in English follow the same trend, as shown by the 
next curves in Fig. 2. for the adjectives in Moby Dick and Ulysses. Even more unexpectedly, the 
verbs and adjective distributions from texts in French, Spanish, and Finnish (a highly agglutinative 
language, in contrast with the other cases) display the same quantitative behaviour, displayed 
in Fig. 2 and in the Supplementary Information; in all cases y is in the range 0.60 ±0.10. It 
appears that the dimensionless inter- appearance distance distribution is "universal" in the sense 



of statistical physics [9]: many different systems (texts) obey the same law despite significant 
differences in their "microscopic" details (style, grammatical rules). 

Although it has been suggested that Zipf 's law may have an elementary explanation [|70, clearly 
the distributions D s (6) are far from trivial. This would be the case if each word followed a Poisson 
process, as happens in random texts (see Supplementary Information). Instead, the shape of the 
scaling function implies a clustering phenomenon: the appearance of a word tends to "attract" 
more appearances, as the distribution is dominated by a decreasing power law for 6 < 1. This 
indicates an increased probability for small inter-appearance distances relative to the Poisson pro- 
cess, which is characterized by a pure exponential distribution and approximated by F(6) — 1 for 
6 < 0. 1 . The difference is clear between £ ~ 10~ 4 £ w and 0. l£ w , but beyond £ ~ £ w it is difficult to 
distinguish the distribution from a Poisson process. 

On the other hand, the plot also shows that clustering and data collapse are not valid for very 
short distances, £ < 10; rather, some anticlustering ("repulsion") shows up instead, probably due 
to grammatical and stylistic restrictions on word repetition within the same sentence. In Figs. 1 
and 2, this phenomenon is visible as a downturn in many distributions with respect to the scaling 
function on the left-hand side. 

Clustering properties of words have some striking similarities to the occurrence of natural haz- 
ards Jiol fllll; the time delay between earthquakes is shown in the lowest curve of Fig. 2 ("earthQ") 
for comparison. This suggests that the models used to describe aftershock triggering may also pro- 
vide insight into the process of text generation: the appearance of a word enhances its likelihood 
for a certain time, but without a characteristic scale up to the mean distance £ w . This result also 
validates Skinner's hypothesis regarding the repetitive appearance of sounds in speech | |17|1 . 

The case of nouns and pronouns is more intricate. Their overall distributions D s (6) clearly 
deviate from the function F. It turned out that for some words, inter- appearance distances con- 
siderably larger than the mean value (0^1) are more common than the scaling relation would 
predict. If we remove by hand the relatively few nouns and pronouns with anomalous behaviours, 
we recover the same law followed by verbs and adjectives. For instance, for Clarissa one only 
needs to eliminate 12 nouns and 10 pronouns (out of 315 and 34, respectively). 

We now turn our attention to these special words. Surprisingly, they appear to follow a new 
type of structure. We display in Fig. 3a the distributions for 9 nouns (letter, lady, mother, brother, 
father, sister, uncle, lord, cousin) and 6 pronouns (his, your, her, him, she, he). Both groups are 
divided into two frequency groups. The four distributions still collapse onto a unique curve, but 



the results are clearly not fit by the function F. Rather, we need a new scaling function G. A good 
approximation is the stretched exponential function 

Again, rescaling (so that the mean is 1) fixs one of the two parameters: a' ~ T(l/8) /T(2/8). The 
remaining free parameter is 8 = 0.33 ± 0.05. The scaling function G also describes the clustering 
properties of words, but its behaviour is different from that of F. Relative to a Poisson process, 
these words are more likely to occur at both short distances and long distances; the clustering 



'unction G can also be used to describe the 
The remaining 7 nouns and pronouns in 



effect is therefore quite pronounced. Amazingly, the 
times between ups or downs in financial markets Jl2 
Clarissa, fit by neither F nor G, are colonel, captain, sir, you, I, me, and my (see Supplementary 
Information). 

We can proceed by considering what these laws mean for individual words. If we can accurately 
measure a number of single- word probability densities D w (£), then we can compare their shapes 
by means of the scale transformation £ — > £/£ w = and D w (£) — > £ W D W (£). A sufficient condition 
for the collapse of D s (0) to hold is that 

D W (£)=G(£/£ W )/£ W (4) 

(the same holds for F). This scaling law is shown to be very accurate in Fig. 3b, which displays 
the individual distributions for the nouns and pronouns whose averaged distributions were shown 
in Fig. 3a. The extension of this law to other texts and languages is demonstrated in Fig. 3c, with 
impressive results. 

If we now relate the mean distance to Zipf's law (OQ), £ w — 1/ ' fw°^ r", then Eq. © becomes 

D w (£) = G(£/C)/r« (5) 

where G is essentially G after including the normalization constant of Zipf's law. This is just 
the condition of scale invariance for functions with two variables ft 1 811 , and reflects the signature 
characteristic of word repetition: each word follows the same pattern, although on a different 
scale which depends on its average frequency. In other words, different parts of a text (word 
occurrences) have the same structure despite great differences in their scales (this is also the main 
characteristic of fractals). In statistical physics, distinct scaling relations such as F and G (or 
distinct values of the exponent a) define universality classes [9]. In linguistics these universality 
classes comprise different types of words, independent of author and language. 



Going back to the case of adjectives and verbs, it turns out that the single-word distributions are 
described by F only on average; that is, in many cases there are deviations between their rescaled 
D w (£) and F(6). The small deviations appearing at large 6 in Fig. 2 may originate from special 
cases, for instance the word belle (beautiful) in Artamene. The distribution of this word actually 
scales nicely with G, as seen in Fig. 3c. Nevertheless, such cases are rare and their statistical 
weight is so low that they barely modify the shape of F . 

We may wonder if the universality classes that describe word clustering are an intrinsic property 
of language itself, or rather a fundamental characteristic of human behaviour reflected in literary 
works |19, 20(]. The first possibility is favoured by the fact that adverbs and even function words 
(conjunctions, prepositions, and determiners) clearly do not follow a Poisson process, except per- 
haps for those words with the very lowest ranks (see Supplementary Information). Higher-rank 
adverbs and function words display clustering, and are well described by the scaling function F . 

However, we can establish at least one important distinction between the universality classes 
we have found. It turns out that most of the "special" nouns distributed according to G refer to 
persons with a particularly relevant role (note that nouns not referring to persons can also play 
an important role, as is the case with letter in Clarissa, an epistolic novel). Something similar 
is clearly happening with pronouns, as the special cases are always personal or possessive. We 
observe increased clustering for these key words, although unlike Ref. 112 ill we find well-defined 
universality classes. This supports the idea that this kind of clustering originates in the special 
properties of human behaviour [19, 2o| . 

METHODS SUMMARY 

Identification of parts of speech. English words were placed into grammatical categories us- 
ing A. Kilgarriff's lemmatised list, elaborated from the British National Corpus Il22ll . However, 
we have changed the classification of possessive determiners to possessive pronouns [15] as their 
behaviour is consistent with the clustering properties obtained for other pronouns. Words belong- 
ing to more than one category were excluded from the study. For Spanish words, we mainly used 



the Wictionary from Wikipedia [|23j] but also drew on the electronic dictionary built by 



et al. for FreeLing [ 24fl . For French, we made use of the list elaborated by S. Sharoff [25], and 



Padro 



Finnish words were identified from a list available at the CSC, the Finnish IT Centre for Science 



L26]. 
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METHODS 

Titles analyzed. The following table provides details on the texts. The labels (column 1) are 
used in Fig. 2 to identify the curves. The date given is the year of first publication, and the length 
of texts is in millions of words (Mw). Electronic versions of the texts were downloaded from the 
Gutenberg Project's web page yj], except for Artamene yfl. 



label title 



author 



language year length (Mw) 



Clar 


Clarissa 


S. Richardson 


English 


1748 


0.976 


Moby 


Moby Dick 


H. Melville 


English 


1851 


0.215 


Uly 


Ulysses 


J. Joyce 


English 


1918 


0.269 


Arta 


Artamene 


Scudery siblings French 


1649 


2.088 


Brag 


Le Vicomte 


A. Dumas 


French 


1847 


0.699 




de Bragelonne 










DonQ 


Don Quijote 


M. Cervantes 


Spanish 


1605 


0.381 


LaRe 


La Regenta 


L. A. "Clarm" 


Spanish 


1884 


0.308 


Keva 


Keva j a 


J. Aho 


Finnish 


1906 


0.114 




Takatalvi 











[1] http://www.gutenberg.org. 
[2] http://www.artamene.org. 



FIG. 1: Scaling and clustering of inter-appearance distance distributions. Verbs appearing in their 
root form in Clarissa are considered. Each distribution includes all words falling into one of six equal 
logarithmic ranges of absolute frequency n w . All the distributions are well described by a unique shape: 
the gamma distribution F explained in the text with y = 0.60 and a = l/y (solid line). The exponential 
function, characteristic of Poisson processes, is shown for comparison. 

FIG. 2: Universality of inter-appearance distance distributions. The adjectives in several novels in 
different languages are analysed (see Methods), except for Finnish ("Keva"), where we have considered 
verbs, due to their better statistics. From top to bottom the distributions are multiplied by 1, 10~ 2 , 10~ 4 , and 
so on, to avoid overlapping the curves. Recurrence-time distributions for earthquakes (earthQ) in Southern 
California are included at the bottom for comparison (the distributions include all events with magnitude 



M > M c between 1995 and 1998, where M c = 2, 2.5, 3, and 3.5 [10]). 



FIG. 3: Scaling in a different universality class for special nouns and pronouns, (a) Average rescaled 
probability densities for the 9 nouns and 6 pronouns from Clarissa listed in the text, where each group is 
divided into two frequency ranges. A stretched exponential function G with 8 = 0.33 describes all four 
distributions well. The scaling function F is also displayed for comparison, (b) Corresponding rescaled 
distance distributions for individual words, (c) Rescaled distance distributions for words from other works, 
including the adjective belle. 
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